Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

184 ◾ Bioinformatics

The available annotation column names are displayed with the following (Figure 5.7):

columns(org.Hs.eg.db)

Each of these annotation columns has a row value corresponding to the gene annotated

in the reference sequence. We can select the annotation columns that we need and add

an annotation slot with the selected columns to the DGEList object. The following script

creates a vector of the Entrez IDs mapped to the gene symbol on the counts data, makes

the Entrez IDs as the row names, selects annotation columns mapped to the count data,

adds the annotation as a slot to the DGEList object, and finally removes any row without

an Entrez ID:

ENTREZID <- mapIds(org.Hs.eg.db,rownames(y),

keytype=”SYMBOL”,column=”ENTREZID”)

rownames(y$counts) <- ENTREZID

ann<-select(org.Hs.eg.db,keys=rownames(y$counts),

columns=c(“ENTREZID”,”SYMBOL”,”GENENAME”))

head(ann)

y$genes <- ann

i <- is.na(y$genes$ENTREZID)

y <- y[!i, ]

Figure 5.8 shows the annotation slot “genes” that includes Entrez IDs, gene symbols, and

gene names mapping to the count data “counts”.

5.3.7.3 Design Matrix

The design matrix includes dummy variables that define the covariates of the model,

depending on the study design to answer specific research questions. We will define the

FIGURE 5.8 Adding annotation to the count data.

FIGURE 5.7 Annotation columns available on “org.Hs.eg.db”.